from sklearn.cluster import KMeansfrom sklearn.preprocessing import StandardScalerimport pandas as pdimport matplotlib.pyplot as pltimport seaborn as snsfrom sklearn.preprocessing import LabelEncoderfrom sklearn.metrics import adjusted_rand_scoreeda = pd.read_parquet("data/eda.parquet")
---------------------------------------------------------------------------ImportError Traceback (most recent call last)
Cell In[1], line 8 6fromsklearn.preprocessingimport LabelEncoder
7fromsklearn.metricsimport adjusted_rand_score
----> 8 eda =pd.read_parquet("data/eda.parquet")
File /opt/hostedtoolcache/Python/3.11.12/x64/lib/python3.11/site-packages/pandas/io/parquet.py:651, in read_parquet(path, engine, columns, storage_options, use_nullable_dtypes, dtype_backend, filesystem, filters, **kwargs) 498@doc(storage_options=_shared_docs["storage_options"])
499defread_parquet(
500 path: FilePath | ReadBuffer[bytes],
(...) 508**kwargs,
509 ) -> DataFrame:
510""" 511 Load a parquet object from the file path, returning a DataFrame. 512 (...) 648 1 4 9 649 """--> 651 impl =get_engine(engine) 653if use_nullable_dtypes isnot lib.no_default:
654 msg = (
655"The argument 'use_nullable_dtypes' is deprecated and will be removed " 656"in a future version." 657 )
File /opt/hostedtoolcache/Python/3.11.12/x64/lib/python3.11/site-packages/pandas/io/parquet.py:67, in get_engine(engine) 64exceptImportErroras err:
65 error_msgs +="\n - "+str(err)
---> 67raiseImportError(
68"Unable to find a usable engine; " 69"tried using: 'pyarrow', 'fastparquet'.\n" 70"A suitable version of " 71"pyarrow or fastparquet is required for parquet " 72"support.\n" 73"Trying to import the above resulted in these errors:" 74f"{error_msgs}" 75 )
77if engine =="pyarrow":
78return PyArrowImpl()
ImportError: Unable to find a usable engine; tried using: 'pyarrow', 'fastparquet'.
A suitable version of pyarrow or fastparquet is required for parquet support.
Trying to import the above resulted in these errors:
- Missing optional dependency 'pyarrow'. pyarrow is required for parquet support. Use pip or conda to install pyarrow.
- Missing optional dependency 'fastparquet'. fastparquet is required for parquet support. Use pip or conda to install fastparquet.
---------------------------------------------------------------------------NameError Traceback (most recent call last)
Cell In[2], line 1----> 1 features =eda[['SALARY', 'MAX_YEARS_EXPERIENCE', 'MIN_YEARS_EXPERIENCE']].copy()
3for col in ['MAX_YEARS_EXPERIENCE', 'MIN_YEARS_EXPERIENCE', 'SALARY']:
4 features[col] = pd.to_numeric(features[col], errors='coerce')
NameError: name 'eda' is not defined
Code
import plotly.express as pximport plotly.graph_objects as gofrom IPython.display import HTML# 1) Build the DataFramedf_plot = features.copy()df_plot['Cluster'] = eda.loc[features.index, 'Cluster']# 2) Compute centroids in original unitscentroids = kmeans.cluster_centers_centroids_x = centroids[:, 0] * X.std(axis=0)[0] + X.mean(axis=0)[0]centroids_y = centroids[:, 1] * X.std(axis=0)[1] + X.mean(axis=0)[1]# 3) Create an interactive Plotly Figurefig = px.scatter( df_plot, x='SALARY', y='MAX_YEARS_EXPERIENCE', color='Cluster', title="KMeans Clustering by Salary and Max Years Experience", labels={'SALARY': 'Salary','MAX_YEARS_EXPERIENCE': 'Max Years Experience','Cluster': 'Cluster' }, width=800, height=500,)# 4) Add centroid tracesfig.add_trace( go.Scatter( x=centroids_x, y=centroids_y, mode='markers', marker=dict(symbol='x', size=18, color='black', line=dict(width=2, color='white')), name='Centroids' ))fig.write_html("figures/analytics_plot1.html", include_plotlyjs="cdn", full_html=True)fig
---------------------------------------------------------------------------NameError Traceback (most recent call last)
Cell In[3], line 6 3fromIPython.displayimport HTML
5# 1) Build the DataFrame----> 6 df_plot =features.copy()
7 df_plot['Cluster'] = eda.loc[features.index, 'Cluster']
9# 2) Compute centroids in original unitsNameError: name 'features' is not defined
Here we have 4 cluster groups. Group 0, which represent as green have lower salary, mostly under 150k, and max years experience in 2-5 years, it is likely Likely junior to mid-level employees with moderate pay. Group 1 with orange, has medium to high salary, wide range from $100k–$500k and with narrow range ~3 years, they are suggests specialized or high-paying roles with short experience — possibly fast-track promotions or high-demand fields. cluster 2 are low salary and experience from 0-4 years, they are clearly entry level employee. cluster 3 has medium salary, mostly under 200k with higher experiences, like 6-13 eyars. They probably are senior professionals with more experience but not the highest salaries.
Code
import pandas as pdfrom sklearn.linear_model import LinearRegressionfrom sklearn.model_selection import train_test_splitfrom sklearn.metrics import mean_squared_error, r2_scoreimport plotly.graph_objects as go# Prepare features & targetfeatures = eda[['MIN_YEARS_EXPERIENCE', 'MAX_YEARS_EXPERIENCE']].apply(pd.to_numeric, errors='coerce')features = features.dropna()X = featuresy = eda.loc[X.index, 'SALARY']# Train/test splitX_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=688)# Fit model & predictmodel = LinearRegression()model.fit(X_train, y_train)y_pred = model.predict(X_test)# Metrics (optional, but handy)mse = mean_squared_error(y_test, y_pred)r2 = r2_score(y_test, y_pred)print(f"MSE: {mse:.2f}, R²: {r2:.3f}")# Define min/max for the identity linemin_val = y_test.min()max_val = y_test.max()
---------------------------------------------------------------------------NameError Traceback (most recent call last)
Cell In[4], line 8 5importplotly.graph_objectsasgo 7# Prepare features & target----> 8 features =eda[['MIN_YEARS_EXPERIENCE', 'MAX_YEARS_EXPERIENCE']].apply(pd.to_numeric, errors='coerce')
9 features = features.dropna()
10 X = features
NameError: name 'eda' is not defined
This plot shows the Actual vs. Predicted Salary using a multiple linear regression model. The blue dots represent individual predictions, and the red dashed line is the ideal line where predicted = actual. Since most points lie very close to the red line, it means your model predicts salary very accurately, with minimal error and strong linear fit — likely reflected in a high R² score near 1.0.